CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis

نویسندگان

Carmen García-Mateo

Antonio Cardenal López

Xose Luis Regueira

Elisa Fernández Rei

Marta Martinez

Roberto Seara

Rocío Varela

Noemí Basanta

چکیده

This paper describes the CORILGA (“Corpus Oral Informatizado da Lingua Galega”). CORILGA is a large high-quality corpus of spoken Galician from the 1960s up to present-day, including both formal and informal spoken language from both standard and non-standard varieties, and across different generations and social levels. The corpus will be available to the research community upon completion. Galician is one of the EU languages that needs further research before highly effective language technology solutions can be implemented. A software repository for speech resources in Galician is also described. The repository includes a structured database, a graphical interface and processing tools. The use of a database enables to perform search in a simple and fast way based in a number of different criteria. The web-based user interface facilitates users the access to the different materials. Last but not least a set of transcription-based modules for automatic speech recognition has been developed, thus facilitating the orthographic labelling of the recordings.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhanced CORILGA: Introducing the Automatic Phonetic Alignment Tool for Continuous Speech

The Corpus Oral Informatizado da Lingua Galega (CORILGA) project aims at building a corpus of oral language for Galician, primarily designed to study the linguistic variation and change. This project is currently under development and it is periodically enriched with new contributions. The long-term goal is that all the speech recordings will be enriched with phonetic, syllabic, morphosyntactic...

متن کامل

Semi-Automatic Phonological Annotations of Speech by Grammatical Inference

This paper describes a technique for automatically generating multiple levels of linguistic annotation for a corpus of speech utterances. Using a training corpus of multilevel annotations, a corresponding finite-state representation is automatically constructed by grammatical inference. This finite-state description is then employed as a knowledge component to automatically generate a new multi...

متن کامل

A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis

This paper will present the morphosintactic tagger and the corpus of contemporary written Galician which are being employed in the development of the Galician version of our tex-to-speech synthesizer. Their quality and accuracy make them useful for speech technology applications and turn them into possible references for further investigation and research projects about Galician language. In es...

متن کامل

Análisis morfosintáctico estadístico en lengua gallega

This paper describes a morphosyntactic analyser in Galician which, apart from its obvious linguistic interest, can be easily applied to speech recognition and speech synthesis systems. While rule-driven models produce the better performance, stochastic models have shown a comparable accuracy when properly designed. Moreover, rule-driven models are based on a complex set of linguistic rules, qui...

متن کامل

Specific features of the Galician language and implications for speech technology development

In this article we present the main linguistic and phonetic features of Galician which need to be considered in the development of speech technology applications for this language. We also describe the solutions adopted in our text-to-speech system, also useful for speech recognition and speech-to-speech translation. On the phonetic plane in particular, the handling of vocal contact and the det...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

CORILGA: a Galician Multilevel Annotated Speech Corpus for Linguistic Analysis

نویسندگان

چکیده

منابع مشابه

Enhanced CORILGA: Introducing the Automatic Phonetic Alignment Tool for Continuous Speech

Semi-Automatic Phonological Annotations of Speech by Grammatical Inference

A Galician Textual Corpus for Morphosyntactic Tagging with Application to Text-to-Speech Synthesis

Análisis morfosintáctico estadístico en lengua gallega

Specific features of the Galician language and implications for speech technology development

عنوان ژورنال:

اشتراک گذاری